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Abstract 

The learning of appropriate distance metrics is a critical problem in image classifi- 
cation and retrieval. In this work, we propose a boosting-based technique, termed 
BoostMetric, for learning a Mahalanobis distance metric. One of the primary 
difficulties in learning such a metric is to ensure that the Mahalanobis matrix re- 
mains positive semidefinite. Semidefinite programming is sometimes used to en- 
force this constraint, but does not scale well. BoostMetric is instead based 
on a key observation that any positive semidefinite matrix can be decomposed 
into a linear positive combination of trace-one rank-one matrices. BoostMet- 
ric thus uses rank-one positive semidefinite matrices as weak learners within an 
efficient and scalable boosting-based learning process. The resulting method is 
easy to implement, does not require tuning, and can accommodate various types 
of constraints. Experiments on various datasets show that the proposed algorithm 
compares favorably to those state-of-the-art methods in terms of classification ac- 
curacy and running time. 



1 Introduction 

It has been an extensively sought-after goal to learn an appropriate distance metric in image classifi- 
cation and retrieval problems using simple and efficient algorithms [1, 2, 3, 4, 5]. Such distance met- 
rics are essential to the effectiveness of many critical algorithms such as fc -nearest neighbor (fcNN), 
/c-means clustering, and kernel regression, for example. We show in this work how a Mahalanobis 
metric is learned from proximity comparisons among triples of training data. Mahalanobis dis- 
tance, a.k.a. Gaussian quadratic distance, is parameterized by a positive semidefinite (p.s.d.) matrix. 
Therefore, typically methods for learning a Mahalanobis distance result in constrained semidefinite 
programs. We discuss the problem setting as well as the difficulties for learning such a p.s.d. ma- 
trix. If we let ai, i = 1, 2 ■ ■ ■ , represent a set of points in R D , the training data consist of a set of 
constraints upon the relative distances between these points, S = {(ai,aj,afc)|dist{j < distjfc}, 
where dist i:) measures the distance between a, and a^. We are interested in the case that dist 
computes the Mahalanobis distance. The Mahalanobis distance between two vectors takes the form: 
||a, — & 3 ■ ||x = \J (a* — aj) T X(ai — a,), with X > 0, a p.s.d. matrix. It is equivalent to learn a pro- 
jection matrix L and X = LL T . Constraints such as those above often arise when it is known that a^ 
and aj belong to the same class of data points while a^ , a^ belong to different classes. In some cases, 
these comparison constraints are much easier to obtain than either the class labels or distances be- 
tween data elements. For example, in video content retrieval, faces extracted from successive frames 
at close locations can be safely assumed to belong to the same person, without requiring the indi- 
vidual to be identified. In web search, the results returned by a search engine are ranked according 
to the relevance, an ordering which allows a natural conversion into a set of constraints. 



* NICTA is funded through the Australian Government's Backing Australia's Ability initiative, in part 
through the Australian Research Council. 



The requirement of X being p.s.d. has led to the development of a number of methods for learning 
a Mahalanobis distance which rely upon constrained semidefinite programing. This approach has a 
number of limitations, however, which we now discuss with reference to the problem of learning a 
p.s.d. matrix from a set of constraints upon pairwise-distance comparisons. Relevant work on this 
topic includes [3, 4, 5, 6, 7, 8] amongst others. 

Xing et al [4] firstly proposed to learn a Mahalanobis metric for clustering using convex optimiza- 
tion. The inputs are two sets: a similarity set and a dis-similarity set. The algorithm maximizes the 
distance between points in the dis-similarity set under the constraint that the distance between points 
in the similarity set is upper-bounded. Neighborhood component analysis (NCA) [6] and large mar- 
gin nearest neighbor (LMNN) [7] learn a metric by maintaining consistency in data's neighborhood 
and keeping a large margin at the boundaries of different classes. It has been shown in [7] that 
LMNN delivers the state-of-the-art performance among most distance metric learning algorithms. 

The work of LMNN [7] and PSDBoost [9] has directly inspired our work. Instead of using hinge 
loss in LMNN and PSDBoost, we use the exponential loss function in order to derive an AdaBoost- 
like optimization procedure. Hence, despite similar purposes, our algorithm differs essentially in 
the optimization. While the formulation of LMNN looks more similar to support vector machines 
(SVM's) and PSDBoost to LPBoost, our algorithm, termed BoostMetric, largely draws upon 
AdaBoost [10]. 

In many cases, it is difficult to find a global optimum in the projection matrix L [6]. Reformulation- 
linearization is a typical technique in convex optimization to relax and convexify the problem [11]. 
In metric learning, much existing work instead learns X = LL T for seeking a global optimum, e.g., 
[4, 7, 12, 8]. The price is heavy computation and poor scalability: it is not trivial to preserve the 
semidefiniteness of X during the course of learning. Standard approaches like interior point Newton 
methods require the Hessian, which usually requires 0(D A ) resources (where D is the input dimen- 
sion). It could be prohibitive for many real-world problems. Alternative projected (sub-)gradient is 
adopted in [7, 4, 8]. The disadvantages of this algorithm are: (1) not easy to implement; (2) many 
parameters involved; (3) slow convergence. PSDBoost [9] converts the particular semidefinite pro- 
gram in metric learning into a sequence of linear programs (LP's). At each iteration of PSDBoost, an 
LP needs to be solved as in LPBoost, which scales around 0( J 3 ' 5 ) with J the number of iterations 
(and therefore variables). As J increases, the scale of the LP becomes larger. Another problem is 
that PSDBoost needs to store all the weak learners (the rank-one matrices) during the optimization. 
When the input dimension D is large, the memory required is proportional to JD 2 , which can be 
prohibitively huge at a late iteration J. Our proposed algorithm solves both of these problems. 

Based on the observation from [9] that any positive semidefinite matrix can be decomposed into a 
linear positive combination of trace-one rank-one matrices, we propose BoostMetric for learning 
a p.s.d. matrix. The weak learner of BoostMetric is a rank-one p.s.d. matrix as in PSDBoost. 
The proposed BoostMetric algorithm has the following desirable properties: (1) BoostMetric 
is efficient and scalable. Unlike most existing methods, no semidefinite programming is required. 
At each iteration, only the largest eigenvalue and its corresponding eigenvector are needed. (2) 
BOOSTMETRIC can accommodate various types of constraints. We demonstrate learning a Maha- 
lanobis metric by proximity comparison constraints. (3) Like AdaBoost, BoostMetric does not 
have any parameter to tune. The user only needs to know when to stop. In contrast, both LMNN 
and PSDBoost have parameters to cross validate. Also like AdaBoost it is easy to implement. No 
sophisticated optimization techniques such as LP solvers are involved. Unlike PSDBoost, we do not 
need to store all the weak learners. The efficacy and efficiency of the proposed BoostMetric is 
demonstrated on various datasets. 

Throughout this paper, a matrix is denoted by a bold upper-case letter (X); a column vector is 
denoted by a bold lower-case letter (x). The ith row of X is denoted by Xj : and the ith column 
X.j. Tr(-) is the trace of a symmetric matrix and (X, Z) = Tr(XZ T ) = £\ . XyZjj calculates the 
inner product of two matrices. An element-wise inequality between two vectors like u < v means 
Ui < Vi for all i. We use X >p to indicate that matrix X is positive semidefinite. 

For a matrix X e S D , the following statements are equivalent: (1) X > (X e S+); (2) All 
eigenvalues of X are nonnegative (Aj(X) > 0, i = 1, • • • , D); and (3) Vw G R D , m t Xm > 0. 



2 Algorithms 



In this section, we define the mathematical problems ((PO), (PI)) we want to solve. In order to 
derive an efficient optimization strategy, we investigate the dual problem ((Dl)) as well from a 
convex optimization viewpoint. 

2.1 Distance Metric Learning 

As discussed, the Mahalanobis metric is equivalent to linearly transform the data by a projection 
matrix L £ R Dxd (usually D > d) before calculating the standard Euclidean distance: 

dist^ = ||L T ai - L T aj||2 = (a* - a J ) T LL T (a i - a,-) = (a* - a J ) T X(a l - a,). (1) 

Although one can learn L directly as many conventional approaches do, in this setting, non-convex 
constraints are involved, which make the problem difficult to solve. As we will show, in order to 
convexify these conditions, a new variable X = LL T is introduced instead. This technique has been 
used widely in convex optimization and machine learning such as [12]. If X = I, it reduces to the 
Euclidean distance. If X is diagonal, the problem corresponds to learning a metric in which the 
different features are given different weights, a.k.a. feature weighting. 

In the framework of large-margin learning, we want to maximize the distance between dist^ and 
dist^. That is, we wish to make dist? ■ — dist? fe = (a,— afe) T X(ai — a^) — (a,; — aj) T X(a, — a^) as 
large as possible under some regularization. To simplify notation, we rewrite the distance between 
dist^ and dist- fe as dist^ - dist^ fe = (A r ,X), 

A r = (a, - a fc )(a 4 - a fe ) T - (a* - a^a* - aj) T , (2) 
r =!,-■■ ,\S\. \S\ is the size of the setS*. 

2.2 Learning with Exponential Loss 

We derive a general algorithm for p.s.d. matrix learning with exponential loss. Assume that we want 
to find a p.s.d. matrix X fc= such that a bunch of constraints 

(A r ,X) >0,r = l,2,--- , 

are satisfied as well as possible. These constraints need not be all strictly satisfied. We can define 
the margin p r = (A r , X), Vr. By employing exponential loss, we want to optimize 

minlog(Ei=iexp-p r ) +i>Tr(X) s.t. p r = (A r , X), r = 1, • • • , \S\, X > 0. (PO) 

Note that: (1) We have worked on the logarithmic version of the sum of exponential loss. This 
transform does not change the original optimization problem of sum of exponential loss because 
the logarithmic function is strictly monotonically decreasing. (2) A regularization term Tr(X) has 
been applied. Without this regularization, one can always multiply an arbitrarily large factor to X 
to make the exponential loss approach zero in the case of all constraints being satisfied. This trace- 
norm regularization may also lead to low-rank solutions. (3) An auxiliary variable p r , r = 1, . . . 
must be introduced for deriving a meaningful dual problem, as we show later. 

We can decompose X into: X = ^2j=iWjZj, w i tn w j — 0. rank(Zj) = 1 and Tr(Zj) = 1, Vj. 
So 

p r = (A r ,X) = (A r ,Y/j=iWj'Z'j) = Y^j=i w j( A r, Zj) = Z)j=iWj H rj = H r: u;,Vr. (3) 
Here U rj is a shorthand for H rj = (A r , Zj). Clearly Tr(X) = E/=i™7 ^(Zj) = 1 T ^. 

2.3 The Lagrange Dual Problem 

We now derive the Lagrange dual of the problem we are interested in. The original problem (PO) 
now becomes 

min log(E^j exp — p r ) + vlJw, s.t. p r = H r: w, r = 1, • • • , \S\, w > 0. (PI) 



In order to derive its dual, we write its Lagrangian 



L(w, p, u, p) = log(X)Ui exp -p r ) + vl T w + Y) r =i u r(.Pr - H r: w) - p T w , (4) 

with p > 0. Here u and p are Lagrange multipliers. The dual problem is obtained by finding the 
saddle point of L; i.e., sup u inf TO p L. 



inf L = inf log(^L=i cx P-Pr) + « T p + inf (vlJ - Y^fliUr'H.r-. -p T )w = - Y^f=i u r log u r- 

w,p p w 

The infimum of L\ is found by setting its first derivative to zero and we have: 

_ (-^2 r u r logu r if it > 0, l T w = 1, 
p 1 \— oo otherwise. 

The infimum is Shannon entropy. L 2 is linear in w, hence L 2 must be 0. It leads to 

T, l r=lU r n r: < V1 T . (5) 

The Lagrange dual problem of (PI) is an entropy maximization problem, which writes 

max — ^LliUrlogUr, s.t. u > 0, 1 M=l,and(5). (Dl) 

u 

Weak and strong duality hold under mild conditions [11]. That means, one can usually solve one 
problem from the other. The KKT conditions link the optimal between these two problems. In our 

case, it is 

< = -|^— W. (6) 

EUi gx p -Pk 



While it is possible to devise a totally-corrective column generation based optimization procedure 
for solving our problem as the case of LPBoost [13], we are more interested in considering one-at- 
a-time coordinate-wise descent algorithms, as the case of AdaBoost [10], which has the advantages: 
(1) computationally efficient and (2) parameter free. Let us start from some basic knowledge of 
column generation because our coordinate descent strategy is inspired by column generation. 

If we knew all the bases Zj ( j = 1 . . . J) and hence the entire matrix H is known, then either the 
primal (PI) or the dual (Dl) could be trivially solved (at least in theory) because both are convex 
optimization problems. We can solve them in polynomial time. Especially the primal problem is 
convex minimization with simple nonnegativeness constraints. Off-the-shelf software like LBFGS- 
B [14] can be used for this purpose. Unfortunately, in practice, we do not access all the bases: the 
number of possible Z's is infinite. In convex optimization, column generation is a technique that is 
designed for solving this difficulty. 

Instead of directly solving the primal problem (PI), we find the most violated constraint in the dual 
(Dl) iteratively for the current solution and add this constraint to the optimization problem. For this 
purpose, we need to solve 

Z = argmax z {£l=i u r(A r , Z), s .t. Ze^}. (7) 

Here Qi is the set of trace-one rank-one matrices. We discuss how to efficiently solve (7) later. Now 
we move on to derive a coordinate descent optimization procedure. 



2.4 Coordinate Descent Optimization 

We show how an AdaBoost-like optimization procedure can be derived for our metric learning prob- 
lem. As in AdaBoost, we need to solve for the primal variables wj given all the weak learners up to 
iteration j. 



Algorithm 1 Bisection search for Wj. 



Input: An interval [wi, w u ] known to contain the optimal value of Wj and convergence tolerance e > 0. 
repeat 

• wj = 0.5(w; + w u ); 

■ if l.h.s. of (8) > then 

else 

W u = Wj. 

until w u — wi < e ; 
Output: Wj. 



Optimizing for Wj Since we are interested in the one-at-a-time coordinate-wise optimization, we 
keep w%, W2, ■ ■ . , fixed when solving for Wj. The cost function of the primal problem is (in 
the following derivation, we drop those terms irrelevant to the variable Wj) 



G p {wj) = log[X)i-=! en>(~4 X ) ' exp(-U rjWj )] 



3 ' 



Clearly, C p is convex in Wj and hence there is only one minimum that is also globally optimal. The 
first derivative of C v w.r.t. Wj vanishes at optimality, which results in 

EiSiCHrj " ex P (-^H rj ) - 0. (8) 

If H r j is discrete, such as {+1, —1} in standard AdaBoost, we can obtain a close-form solution 
similar to AdaBoost. Unfortunately in our case, H rj can be any real value. We instead use bisection 
to search for the optimal Wj. The bisection method is one of the root-finding algorithms. It repeat- 
edly divides an interval in half and then selects the subinterval in which a root exists. Bisection is a 
simple and robust, although it is not the fastest algorithm for root-finding. Newton-type algorithms 
are also applicable here. Algorithm 1 gives the bisection procedure. We have utilized the fact that 
the l.h.s. of (8) must be positive at wi. Otherwise no solution can be found. When w 3 - — 0, clearly 
the l.h.s. of (8) is positive. 

Updating u The rule for updating the dual variable u can be easily obtained from (6). At iteration 
j, we have 

u J r oc exp—p J r oc ul" 1 exp(~H r jWj), and X)l~i u r = li 
derived from (6). So once Wj is calculated, we can update u as 

_ uj- 1 exp(-H r jWj) 



(9) 

^ I CI 

where z is a normalization factor so that X)r =i u r = 1- This is exactly the same as AdaBoost. 



2.5 Base Learning Algorithm 

In this section, we show that the optimization problem (7) can be exactly and efficiently solved using 
eigenvalue-decomposition (EVD). From Z and rank(Z) = 1, we know that Z has the format: 
Z = ££ T , £ G R D ; and Tr(Z) = 1 means ||£|| 2 = 1. We have 

By denoting 

A = J2l S l lUr A r , (10) 
the base learning optimization equals: max^ £ T A£, s.t. ||£|| 2 = 1. It is clear that the largest 
eigenvalue of A, A max (A), and its corresponding eigenvector ^ gives the solution to the above 
problem. Note that A is symmetric. Also see [9] for details. 

Amax(A) is also used as one of the stopping criteria of the algorithm. Form the condition (5), 
Amax(A) < v means that we are not able to find a new base matrix Z that violates (5) — the algorithm 
converges. We summarize our main algorithmic results in Algorithm 2. 



Algorithm 2 Positive semidefinite matrix learning with boosting. 



Input: 

• Training set triplets (a*, a,, a*.) G S; Compute A ri r = 1, 2, ■ • ■ , using (2). 

• J: maximum number of iterations; 

• (optional) regularization parameter v; We may simply set v to a very small value, e.g., 10~ 7 . 
Initialize: u° = tL, r = 1 • • • \S\; 

for j = 1,2, • • • , J do 

■ Find a new base Zj by finding the largest eigenvalue (A max (A)) and its eigenvector of A in (10); 

• if A m ax(A) < v then 

break (converged); 

• Compute Wj using Algorithm 1 ; 

• Update u to obtain vP r , r = 1, • • • \S\ using (9); 

Output: The final p.s.d. matrix XeI flxD ,X = ]T J =1 wjZj. 



* + * 
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Figure 1: The data are projected into 2D with PCA (left), LDA (middle) and BoostMetric (right). Both 
PCA and LDA fail to recover the data structure. The local structure of the data is preserved after projection by 
BoostMetric. 



3 Experiments 

In this section, we present experiments on data visualization, classification and image retrieval tasks. 



3.1 An Illustrative Example 

We demonstrate a data visualization problem on an artificial toy dataset (concentric circles) in Fig. 1 . 
The dataset has four classes. The first two dimensions follow concentric circles while the left eight 
dimensions are all random Gaussian noise. In this experiment, 9000 triplets are generated for train- 
ing. When the scale of the noise is large, PCA fails to find the first two informative dimensions. 
LDA fails too because clearly each class does not follow a Gaussian distraction and their centers 
overlap at the same point. The proposed BoostMetric algorithm find the informative features. 
The eigenvalues of X learned by BoostMetric are {0.542, 0.414, 0.007, 0, • • • ,0}, which indi- 
cates that BoostMetric successfully reveals the data's underlying structure. 



3.2 Classification on Benchmark Datasets 

We evaluate BoostMetric on 15 datasets of different sizes. Some of the datasets have very high 
dimensional inputs. We use PCA to decrease the dimensionality before training on these datasets 
(datasets 2-6). PCA pre-processing helps to eliminate noises and speed up computation. Table 1 
summarizes the datasets in detail. We have used USPS and MNIST handwritten digits, ORL face 



Table 1: Datasets used in the experiment. We report computational time of BoostMetric on each dataset. 
No PCA is applied where it is blank for "dim. after PCA". 



dataset 


# train 


#test 


input dim. 


dim. after PCA 


# classes 


# runs 


time per run 


1 


USPS-1 


5,500 


5,500 


256 




10 


1 


0.8h 


2 


USPS-2 


7,700 


3,300 


256 


64 


10 


10 


lm 


3 


ORLFace-1 


340 


60 


2,576 


128 


40 


10 


15s 


4 


ORLFace-2 


340 


60 


2,576 


42 


40 


10 


10s 


5 


MNIST 


7,000 


3,000 


784 


20 


10 


10 


38s 


6 


COIL20 


1,008 


432 


1,024 


100 


20 


10 


2s 


7 


Letters 


10,500 


4,500 


16 




26 


10 


141s 


8 


Wine 


142 


36 


13 




3 


10 


less than Is 


9 


Bal 


437 


188 


4 




3 


10 


less than Is 


10 


Iris 


105 


45 


4 




3 


10 


less than Is 


11 


Vehicle 


592 


254 


18 




4 


10 


less than Is 


12 


Breast-Cancer 


479 


204 


10 




2 


10 


less than Is 


13 


Diabetes 


538 


230 


8 




2 


10 


less than Is 


14 


Twin Peaks 


14,000 


6,000 


3 




11 


10 


294s 


15 


Helix 


14,000 


6,000 


3 




7 


10 


249s 



recognition datasets, Columbia University Image Library (COIL20) 1 , and UCI machine learning 
datasets 2 (datasets 7-13), Twin Peaks and Helix. The last two are artificial datasets 3 . 

Experimental results are obtained by averaging over 10 runs (except USPS-1). We randomly split the 
datasets for each run. We have used the same mechanism to generate training triplets as described 
in [7]. Briefly, for each training point a i7 k nearest neighbors that have same labels as yi (targets), 
as well as k nearest neighbors that have different labels from yi (imposers) are found. We then 
construct triplets from a; and its corresponding targets and imposers. For all the datasets, we have 
set k — 3 except that k = 1 for datasets USPS-1, ORLFace-1 and ORLFace-2 due to their large 
size. We have compared our method against a few methods: Xing et al [4], RCA [5], NCA [6] 
and LMNN [7]. LMNN is one of the state-of-the-art according to recent studies such as [15]. Also 
in Table 2, "Euclidean" is the baseline algorithm that uses the standard Euclidean distance. The 
codes for these compared algorithms are downloaded from the corresponding authors' websites. We 
have released our codes for BoostMetric at [16]. Experiment setting for LMNN follows [7]. For 
BoostMetric, we have set v = 1CP 7 , the maximum number of iterations J = 500. As we can 
see from Table 2, we can conclude: (1) BoostMetric consistently improves fcNN classification 
using Euclidean distance on most datasets. So learning a Mahalanobis metric based upon the large 
margin concept does lead to improvements in A:NN classification. (2) BoostMetric outperforms 
other algorithms in most cases (on 11 out of 15 datasets). LMNN is the second best algorithm on 
these 15 datasets statistically. LMNN's results are consistent with those given in [7]. (3) Xing et 
al [4] and NCA can only handle a few small datasets. In general they do not perform very well. A 
good initialization is important for NCA because NCA's cost function is non-convex and can only 
find a local optimum. 

Influence of v Previously, we claim that our algorithm is parameter-free like AdaBoost. However, 
we do have a parameter v in BoostMetric. Actually, AdaBoost simply set v = 0. The coordinate- 
wise gradient descent optimization strategy of AdaBoost leads to an ^i-norm regularized maximum 
margin classifier [17]. It is shown that AdaBoost minimizes its loss criterion with an l\ constraint on 
the coefficient vector. Given the similarity of the optimization of BoostMetric with AdaBoost, 
we conjecture that BoostMetric has the same property. Here we empirically prove that as long 
as v is sufficiently small, the final performance is not affected by the value of v. We have set v from 
10~ 8 to 10~ 4 and run BOOSTMETRIC on 3 UCI datasets. Table 3 reports the final 3NN classification 
error with different v. The results are nearly identical. 

Computational time As we discussed, one major issue in learning a Mahalanobis distance is 
heavy computational cost because of the semidefiniteness constraint. 



'http : / / wwwl . cs . Columbia . edu/CAVE/ soft ware/so ft lib/coil-20. php 
2 http : / /archive . ics . uci . edu/ml/ 

3 ht tp:/ /boost ing.googlecode. com/ files/ dataset 1 . tar . bz2 



Table 2: Test classification error rates (%) of a 3-nearest neighbor classifier on benchmark datasets. Results of 
NCA and Xing et al [4] on large datasets are not available either because the algorithm does not converge or 
due to the out-of-memory problem. 



dataset 


Euclidean 


Xing et al [4] 


Dp* 
KL.A 




L1V1IN1N 


BoostMetric 


1 


USPS-1 


5.18 




32.71 




7.51 


2.96 


2 


USPS-2 


3.56 (0.28) 




5.57 (0.33) 




2.18 (0.27) 


1.99 (0.24) 


3 


ORLFace-1 


3.33 (1.47) 




5.75 (2.85) 


3.92 (2.01) 


6.67 (2.94) 


2.00 (1.05) 


4 


ORLFace-2 


5.33 (2.70) 




4.42 (2.08) 


3.75 (1.63) 


2.83 (1.77) 


3.00(1.31) 


5 


MNIST 


4.11 (0.43) 




4.31 (0.42) 




4.19(0.49) 


4.09 (0.31) 


6 


COIL20 


0.19(0.21) 




0.32 (0.29) 




2.41 (1.80) 


0.02 (0.07) 


7 


Letters 


5.74 (0.24) 




5.06 (0.26) 




4.34 (0.36) 


3.54 (0.18) 


8 


Wine 


26.23 (5.52) 


10.38(4.81) 


2.26 (1.95) 


27.36 (6.31) 


5.47 (3.01) 


2.64(1.59) 


9 


Bal 


18.13 (1.79) 


11.12 (2.12) 


19.47 (2.39) 


4.81 (1.80) 


11.87 (2.14) 


8.93 (2.28) 


10 


Ms 


2.22 (2.10) 


2.22 (2.10) 


3.11 (2.15) 


2.89 (2.58) 


2.89 (2.58) 


2.89 (2.78) 


11 


Vehicle 


30.47 (2.41) 


28.66 (2.49) 


21.42 (2.46) 


22.61 (3.26) 


22.57 (2.16) 


19.17 (2.10) 


12 


Breast-Cancer 


3.28(1.06) 


3.63 (0.93) 


3.82(1.15) 


4.31 (1.10) 


3.19 (1.43) 


2.45 (0.95) 


13 


Diabetes 


27.43 (2.93) 


27.87 (2.71) 


26.48 (1.61) 


27.61 (1.55) 


26.78 (2.42) 


25.04 (2.25) 


14 


Twin Peaks 


1.13 (0.09) 




1.02(0.09) 




0.98 (0.11) 


0.14 (0.08) 


15 


Helix 


0.60 (0.12) 




0.61 (0.11) 




0.61 (0.13) 


0.58 (0.12) 



Table 3: Test error (%) of a 3-nearest neighbor classifier with different values of the parameter v. Each 
experiment is run 10 times. We report the mean and variance. As expected, as long as v is sufficiently small, in 
a wide range it almost does not affect the final classification performance. 
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We have shown the running time of the proposed algorithm in Table 1 for the classification tasks . 
Our algorithm is generally fast. It involves matrix operations and an EVD for finding its largest 
eigenvalue and its corresponding eigenvector. The time complexity of this EVD is 0(D 2 ) with D the 
input dimensions. We compare our algorithm's running time with LMNN in Fig. 2 on the artificial 
dataset (concentric circles). We vary the input dimensions from 50 to 1000 and keep the number 
of triplets fixed to 250. LMNN does not use standard interior-point SDP solvers, which do not 
scale well. Instead LMNN heuristically combines sub-gradient descent in both the matrices L and 
X. Instead of using standard interior-point SDP solvers that do not scale well, LMNN heuristically 
combines sub-gradient descent in both the matrices L and X. At each iteration, X is projected 
back onto the p.s.d. cone using EVD. So a full EVD with time complexity 0(D 3 ) is needed. 
Note that LMNN is much faster than SDP solvers like CSDP [18]. As seen from Fig. 2, when 
the input dimensions are low, BoostMetric is comparable to LMNN. As expected, when the 
input dimensions become high, BoostMetric is significantly faster than LMNN. Note that our 
implementation is in Matlab. Improvements are expected if implemented in C/C++. 



4 We have run all the experiments on a desktop with an Intel Core™2 Duo CPU, 4G RAM and Matlab 7.7 
(64-bit version). 
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Figure 2: Computation time of the proposed BOOSTMET- 
RIC and the LMNN method versus the input data's dimen- 
sions on an artificial dataset. BOOSTMETRIC is faster than 
LMNN with large input dimensions because at each iter- 
ation BOOSTMETRIC only needs to calculate the largest 
eigenvector and LMNN needs a full eigen-decomposition. 



rh 







rh 



jEuctidean 

■lmnn 

| BoostMetric 




1000 2000 3000 4000 5000 6000 7000 8000 9000 
Number of triplets 



Figure 3: Test error (3-nearest neighbor) of BoostMetric on the Motorbikes vs. Airplanes datasets. The 
second figure shows the test error against the number of training triplets with a 100- word codebook. Test error 
of LMNN is 4.7% ± 0.5% with 8631 triplets for training, which is worse than BoostMetric. For Euclidean 
distance, the error is much larger: 15% ±1%. 



3.3 Visual Object Categorization and Detection 

The proposed BoostMetric and the LMNN are further compared on four classes of the Caltech- 
101 object recognition database [19], including Motorbikes (798 images), Airplanes (800), Faces 
(435), and Background-Google (520). For each image, a number of interest regions are identified 
by the Harris-affine detector [20] and the visual content in each region is characterized by the SIFT 
descriptor [21]. The total number of local descriptors extracted from the images of the four classes 
are about 134, 000, 84, 000, 57, 000, and 293, 000, respectively. This experiment includes both ob- 
ject categorization (Motorbikes vs. Airplanes) and object detection (Faces vs. Background-Google) 
problems. To accumulate statistics, the images of two involved object classes are randomly split as 
10 pairs of training/test subsets. Restricted to the images in a training subset (those in a test subset 
are only used for test), their local descriptors are clustered to form visual words by using k -means 
clustering. Each image is then represented by a histogram containing the number of occurrences of 
each visual word. 

Motorbikes vs. Airplanes This experiment discriminates the images of a motorbike from those 
of an airplane. In each of the 10 pairs of training/test subsets, there are 959 training images and 
639 test images. Two visual codebooks of size 100 and 200 are used, respectively. With the result- 
ing histograms, the proposed BoostMetric and the LMNN are learned on a training subset and 
evaluated on the corresponding test subset. Their averaged classification error rates are compared 
in Fig. 3 (left). For both visual codebooks, the proposed BoostMetric achieves lower error rates 
than the LMNN and the Euclidean distance, demonstrating its superior performance. We also apply 
a linear SVM classifier with its regularization parameter carefully tuned by 5-fold cross-validation. 
Its error rates are 3.87% ± 0.69% and 3.00% ± 0.72% on the two visual codebooks, respectively. In 
contrast, a 3NN with BOOSTMETRIC has error rates 3.63% ± 0.68% and 2.96% ± 0.59%. Hence, 
the performance of the proposed BoostMetric is comparable to or even slightly better than the 
SVM classifier. Also, Fig. 3 (right) plots the test error of the BoostMetric against the number of 
triplets for training. The general trend is that more triplets lead to smaller errors. 

Faces vs. Background-Google This experiment uses the two object classes as a retrieval prob- 
lem. The target of retrieval is the face images. The images in the class of Background-Google are 
randomly collected from the Internet and they are used to represent the non-target class. Boost- 
Metric is first learned from a training subset and retrieval is conducted on the corresponding test 
subset. In each of the 10 training/test subsets, there are 573 training images and 382 test images. 
Again, two visual codebooks of size 100 and 200 are used. Each face image in a test subset is used 
as a query, and its distances from other test images are calculated by BoostMetric, LMNN and 
the Euclidean distance. For each metric, the precision of the retrieved top 5, 10, 15 and 20 images 
are computed. The retrieval precision for each query are averaged on this test subset and then av- 
eraged over the whole 10 test subsets. We report the retrieval precision in Fig. 4 (with a codebook 
size 100). As shown, BoostMetric consistently attains the highest values, which again verifies its 
advantages over LMNN and the Euclidean distance. With a codebook size 200, very similar results 
are obtained. 
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Figure 4: Retrieval accuracy of distance metric learning algorithms on the Faces versus Background-Google 
datasets. (top: input dimension 100; bottom: input dimension 200). Error bars show the standard deviation. 



4 Conclusion 



We have presented a new algorithm, BoostMetric, to learn a positive semidefinite metric using 
boosting techniques. We have generalized AdaBoost in the sense that the weak learner of Boost- 
Metric is a matrix, rather than a classifier. Our algorithm is simple and efficient. Experiments 
show its better performance over a few state-of-the-art existing metric learning methods. We are 
currently combining the idea of on-line learning into BoostMetric to make it handle even larger 
datasets. 
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